Review of Probability

Steve Elston

09/18/2023

Importance of Probability Theory

Probability Theory Has a Long History

First probability textbook
 Credit, Wikipedia commons

First probability textbook Credit, Wikipedia commons

Probability Distributions

Probability distributions are models for uncertainty of random variables

\[X(\omega) \rightarrow \mathbb{R}\]

Two Types of Probability Distributions

Axioms of Probability

For discrete distributions, we can speak of a set of events within the sample space of all possible events

  1. Probability for any set of events, A, is greater than 0 and less than or equal to 1

\[0 \le P(A) \le 1 \]

  1. The sum of the probability mass functions over the sample space must add to 1

\[P(S) = \sum_{a_i \in A}P(a_i) = 1 \]

  1. If sets of events A and B are mutually exclusive, then the probability of either A and B is the probability of A plus the probability of B

\[P(A\ \cup B) = P(A) + P(B)\\ if\ A \perp B\]

Axioms of Probability

From these three axioms we can draw some useful conclusions

What do you expect: discrete distributions

What value we should expect to find when we sample a random variable?

\[\mathrm{E}[\mathbf{X}] = \sum_{i=1}^n x_i\ p(x_i)\]

How can we interpret expectation?

Properties of Expectation

Useful properties of expectation

  1. The relationship is linear in probability

  2. The expectation of the sum of two random variables, \(X\) and \(Y\), is the sum of the expectations:

\[\mathrm{E}[\mathbf{X + Y}] = \mathrm{E}[\mathbf{X}] + \mathrm{E}[\mathbf{Y}]\]

  1. The expectation of an affine transformation of a random variable, \(X\), is an affine transformation of the expectation:

\[\mathrm{E}[\mathbf{a\ X + b}] = a\ \mathrm{E}[\mathbf{X}] + b\]

Axioms of probability for continuous distributions

Axioms of probability for continuous probability density function, \(f(x)\)

  1. On the interval, \(\{ x_1, x_2 \}\), \(P(x)\), must be bounded by 0 and 1:

\[0 \le \int_{x_1}^{x_2} f(x) dx\ \le 1\]

Note: if \(x_1 = x_2\) the integral is 0

  1. The area under the entire PDF over the limits must be equal to 1:

\[\int_{lower}^{upper} f(x) dx = 1\]

Note: many distributions lower = \(0\) or \(-\infty\) and upper = \(\infty\)

  1. If events A and B are mutually exclusive:

\[P(A\ \cup B) = P(A) + P(B)\ \\ if\ A \perp B\]

What do you expect: continuous distributions

Expected value with PDF \(f(x)\), over the interval, \(\{ a, b \}\):

\[\mathrm{E}[\mathbf{X}] = \int_{a}^b x\ f(x)\ dx\]

Bernoulli and Binomial Distributions

Bernoulli distributions model the results of a single trial or single realization with a binary outcome

\[\begin{align} P(x\ |\ p) &= \bigg\{ \begin{matrix} p \rightarrow x = 1\\ (1 - p) \rightarrow x = 0 \end{matrix}\\ or\\ P(x\ |\ p) &= p^x(1 - p)^{(1-x)}\ x, \in {0, 1} \end{align}\]

Bernoulli and Binomial Distributions

\[P(k\ |\ N, p) = \binom{N}{k} p^k(1 - p)^{(N-k)}\]

\[k = p\ N\]

Distributions for Multiple Outomes; the Categorical and Multinomial Distribution

Many real-world cases have many possible outcomes

The Categorical distribution

Sample space of \(k\) possible outcomes, \(\mathcal{X} = (e_1,e_2, \ldots, e_k)\).

\[\mathbf{e_i} = (0, 0, \ldots, 1, \ldots, 0)\]

\[\begin{align} \Pi &= (\pi_1, \pi_2, \ldots, \pi_k) \\ \\ with\ \sum_{i}\pi_i &= 1 \end{align}\]

The Categorical distribution

And consequently, we can write the simple probability mass function as:

\[f(x_i| \Pi) = \pi_i\]

For a series \(N\) of trials we can estimate each of the probabilities of the possible outcomes, \((\pi_1, \pi_2, \ldots, \pi_k)\):

\[\pi_i = \frac{\#\ e_i}{N}\]

Where \(\#\ e_i\) is the count of outcome \(e_i\).

\[\#\ e_i = \pi_i\ N\]

The Categorical distribution

For the case of \(k=3\) you can visualize the possible outcomes of a single Categorical trial

Simplex for $Mult_3$

Simplex for \(Mult_3\)

Poisson distribution

Poisson distribution models the probability, \(P\), of x arrivals within the time period

\[ P(x\ |\ \lambda) = \frac{\lambda^x}{x!} \exp^{-\lambda} \]

The mean and variance of the Poisson distribution are both equal to the parameter \(\lambda\), or:

\[\begin{align} Mean &= \lambda\\ Variance &= \lambda \end{align}\]

Poisson distribution

Poisson distribution models the probability, \(P\), of x arrivals within the time period

Poisson distribution for several arrival rates

Poisson distribution for several arrival rates

Uniform distribution

Uniform distribution has flat PDF between limits \(\{ a, b \}\)

Write the probability of the Uniform distribution as:

\[ P(x\ | \{a,b \}) = \Bigg\{ \begin{matrix} \frac{1}{(b - a)}\ if\ a \le x \le b \\ else\ 0 \end{matrix} \]

Uniform distribution has the following properties:

\[\begin{align} Mean &= \frac{(a + b)}{2}\\ Variance &= \frac{1}{12}(b - a)^2 \end{align}\]

Uniform distribution

The expectation of a uniform distribution on the interval \(\{ a,b \}\) is easy to work out:

\[\begin{align} \mathrm{E}_{a,b}(\mathbf{X}) &= \int_a^b x\ p(x)\ dx \\ &= \int_a^b x\ dx \\ &= \frac{x^2}{2}\ \big\rvert_a^b \\ &= \frac{a+b}{2} \end{align}\]

Which is just the mean.

Normal distribution

The Normal distribution or Gaussian distribution is one of the most widely used probability distributions

Normal distribution

For a univariate Normal distribution we can write the density function as:

\[P(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\frac{-(x - \mu)^2}{2 \sigma^2}}\]

The parameters can be interpreted as:

\[\begin{align} \mu &= location\ parameter = mean \\ \sigma &= scale = standard\ deviation \\ \sigma^2 &= Variance \end{align}\]

Normal distribution

Normal density for several paramter values

Normal density for several paramter values

Multivariate Normal

Many practical applications have an \(n\)-dimensional parameter vector in \(\mathbb{R}^n\), requiring multivariate distributions

\[f(\vec{\mathbf{x}}) = \frac{1}{{\sqrt{(2 \pi)^n |\mathbf{\Sigma}|}}}exp \big(\frac{1}{2} (\vec{\mathbf{x}} - \vec{\mathbf{\mu}})^T \mathbf{\Sigma} (\vec{\mathbf{x}} - \vec{\mathbf{\mu}})\big)\]

Multivariate Normal

We can write the covariance matrix:

\[ \mathbf{\Sigma} = \begin{bmatrix} \sigma_{1,1} & \sigma_{1,2} & \ldots & \sigma_{1,n} \\ \sigma_{2,1} & \sigma_{2,2} & \ldots & \sigma_{2,n} \\ \vdots & \vdots & \vdots & \vdots \\ \sigma_{n,1} & \sigma_{n,2} & \ldots & \sigma_{n,n} \\ \end{bmatrix} \]

For a Normally distributed n-dimensional multivariate random variable, \(\sigma_{i,j}\) computed from the sample, \(\mathbf{X}\):

\[\begin{align} \sigma_{i,j} &= \mathrm{E} \big[ (\vec{x}_i - \mathrm{E}[\vec{x}_i]) \cdot (\vec{x}_j - \mathrm{E}[\vec{x}_j]) \big] \\ &= \mathrm{E} \big[ (\vec{x}_i - \bar{x}_i) \cdot (\vec{x}_j - \bar{x}_j) \big] \\ &= \frac{1}{k}(\vec{x}_i - \bar{x}_i) \cdot (\vec{x}_j - \bar{x}_j) \end{align}\]

Where \(\cdot\) is the inner product operator and \(\bar{x_i}\) is the mean of \(\vec{x_i}\).

Multivariate Normal

2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 1.0 \end{bmatrix}\)

Multivariate Normal

2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 0.5 \end{bmatrix}\)

Multivariate Normal

2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.5 \\ 0.5 & 1.0 \end{bmatrix}\)

Multivariate Normal

2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & -0.5 \\ -0.5 & 1.0 \end{bmatrix}\)

Multivariate Normal

2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.9 \\ 0.9 & 1.0 \end{bmatrix}\)

Log-Normal distribution

Log Normal distribution is defined for continuous random variables in the range \(0 < x \le \infty\)
- Examples price, weight, length, and volume

\[P(x) = \frac{1}{x} \frac{1}{\sigma \sqrt{2 \pi}} \exp{\frac{-(log(x) - \mu)^2}{2 \sigma^2}}\]

Log-Normal distribution

Log Normal and log transformed example

Log Normal and log transformed example

Student t-distribution

Student t-distribution, or simply the t-distribution, is of importance in statistics since it is the distribution of the difference of means of two Normally distributed random variables

\[ P(x\ |\ \nu) = \frac{\Gamma(\frac{\nu + 1}{2})}{\sqrt{\nu \pi} \Gamma(\frac{\nu}{2})} \bigg(1 + \frac{x^2}{\nu} \bigg)^{- \frac{\nu + 1}{2}}\\ where\\ \Gamma(x) = Gamma\ function \]

Student t-distribution

Dispursion of student-t distribution determined by DOF, \(\nu\)
- Low DOF has heavy tails compared to Normal
- Student-t \(\rightarrow\) standard Normal as \(DOF \rightarrow \infty\)

Heat map of

Heat map of

The Gamma and \(\chi^2\) distributions

Gamma family of distributions includes several members which are important in statistics

The Gamma and \(\chi^2\) distributions

Gamma family can be parameterized in several ways; we will use:
- A shape parameter, \(\nu\), the degrees of freedom
- A scale parameter, \(\sigma\)

\[ Gam(\nu,\sigma)=\frac{x^{\nu-1}\ e^{-x/\sigma}}{\sigma^\nu\ \Gamma(\nu)}\\ where\\ x \ge 0,\ \nu > 0,\ \sigma > 0\\ and\\ \Gamma(\nu) = Gamma\ function \]

The Gamma and \(\chi^2\) distributions

Two useful special cases of the Gamma distribution are:

The \(\chi^2\) Distribution

The \(\chi^2\) distribution is used to construct parametric hypothesis tests of differences in counts between groups and also:

The \(\chi^2\) Distribution

The \(\chi^2\) distribution is a parametric distribution with a single parameter, the degrees of freedom, k = number of possible outcomes - 1.

\[Q = \sum_{i=1}^n Z_i^2\]

\[Q = \chi^2_k = \chi^2(k)\]

The \(\chi^2\) Distribution

The shape of the \(\chi^2\) distribution changes character with the DoF: For \(k = 1\ or\ 2\) the \(\chi^2\) distribution has an exponential decay with the maximum value at \(x=0\)

Heat map of

Heat map of

The \(\chi^2\) Distribution

The shape of the \(\chi^2\) distribution changes character with the DoF: For a middle range of DoF values the density starts at 0 and rises to a maximum or mode and then decay back toward 0

Heat map of

Heat map of

The \(\chi^2\) Distribution

The shape of the \(\chi^2\) distribution changes character with the DoF: For large values of DoF the \(\chi^2\) distribution converges toward a normal distribution with location parameter DoF

Heat map of

Heat map of

Odds

Odds are the ratio of the number of ways an event occurs to the number of ways it does not occur

Odds

What is the relationship between odds and probability of an event?

\[P(A) = \frac{A}{S} = \frac{A}{A + (S - A)} = \frac{A}{A + B} = \frac{count\ in\ favor}{count\ in\ favor\ + count\ not\ in\ favor}\ \\ which\ implies\\ odds = A:(S-A)\]

\[P(H) = \frac{1}{1 + 1} = \frac{1}{2}\]

Summary

\[0 < P(A) \le 1 \] \[P(S) = \sum_{a_i \in A}P(a_i) = 1 \] \[P(A\ \cup B) = P(A) + P(B)\\ if\ A \perp B\]

\[\mathrm{E}[\mathbf{X}] = \sum_{i=1}^n x_i\ p(x_i)\]

Summary

\[f(x_i| \Pi) = \pi_i\]